M6 Lab

Matthew Præstgaard

Spring 2024

Lab Setup

Note: some results may not show up properly in the defauly (dark) theme

R packages

Load necessary packages.

library(tidyverse)
library(here)
library(MASS)
library(pscl)
## Warning: package 'pscl' was built under R version 4.3.3
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.3
library(car)
library(ggpubr)
library(kableExtra)
library(ggthemr)
ggthemr("flat dark")
library(plotly)
library(jtools)
library(ggh4x)
library(broom)

source(here("Functions", "ggHist2.r"))
source(here("Functions", "summariseVar.r"))
source(here("Functions", "ggResid.r"))

Data management

The data were obtained from a survey that was aimed to investigate the factors associated with the number of physician office visits over the past two years. The surveys contain interview from 2,500 people in the United States. They include demographics, health condition, and insurance information of each interviewee.

# read in data
data <- read_csv(here("Lab6", "hw6.csv")) %>%
  drop_na() %>%
  mutate(
    healthStat = factor(health - 1)
  )
# turn health status into a factor variable so it doesn't get treated as a numeric 1, 2, 3

1. Visualize Data

Generate a histogram of the outcome variable and comment on the shape. Summarize the outcome variable using summary statistics. Based on these two analyses, is linear regression appropriate? What type of regression do you think is necessary to answer the remaining questions? Why?

# histogram of outcome
histVisit <- ggHist2(data, visit, bins = 35, title = "Histogram of Number of Visits")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(histVisit)
# summary statistics
sumVisit <- summarize.var(data, visit)
kable(sumVisit)
mean median min max q25 q75 sd variance n IQR
6.5512 5 0 89 2 9 7.253663 52.61562 2500 7

Side note: it would be great if ggplot/tidyverse/Posit would stop always changing their notation guidelines/standards and preferred arguments every 3 months so that I don’t have to see a warning about something benign all the time

Description: The data are heavily right-skewed, as seen in the histogram, and have a standard deviation higher than the mean. Since we are dealing with counts as our outcome variable, linear regression is not appropriate. A Poisson or negative binomial model would be appropriate instead. The high variance (52.62) suggests that negative binomial regression is the more appropriate of the two.

2. Poisson model

Fit a Poisson model that estimates the expected number of physician office visits adjusting for health status, sex, race, condition of limiting activities of living, race, private insurance information, age, chronic conditions, and education. Show the output and interpret the coefficient for the chronic conditions variable. Which variables appear to be associated with the outcome?

# Poisson model
# glm(y ~ x1 + x2 + ..., data = data_name, family = poisson(link = "log"))
visitPois <- glm(visit ~ healthStat + sex + race + adldiff + privins + age + cond + edu, data = data, family = poisson(link = "log"))

# show model output
summ(visitPois)
Observations 2500
Dependent variable visit
Type Generalized linear model
Family poisson
Link log
χ²(9) 1605.67
Pseudo-R² (Cragg-Uhler) 0.47
Pseudo-R² (McFadden) 0.07
AIC 20996.56
BIC 21054.80
Est. S.E. z val. p
(Intercept) 2.28 0.11 21.55 0.00
healthStat1 -0.30 0.02 -13.65 0.00
healthStat2 -0.58 0.04 -13.37 0.00
sex -0.04 0.02 -2.27 0.02
race -0.07 0.03 -2.26 0.02
adldiff 0.13 0.02 6.34 0.00
privins -0.13 0.02 -5.66 0.00
age -0.08 0.01 -6.12 0.00
cond 0.14 0.01 24.67 0.00
edu 0.02 0.00 9.11 0.00
Standard errors: MLE
summary(visitPois)
## 
## Call:
## glm(formula = visit ~ healthStat + sex + race + adldiff + privins + 
##     age + cond + edu, family = poisson(link = "log"), data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.276253   0.105647  21.546  < 2e-16 ***
## healthStat1 -0.303845   0.022258 -13.651  < 2e-16 ***
## healthStat2 -0.578324   0.043254 -13.370  < 2e-16 ***
## sex         -0.036828   0.016198  -2.274   0.0230 *  
## race        -0.066327   0.029364  -2.259   0.0239 *  
## adldiff      0.129884   0.020500   6.336 2.36e-10 ***
## privins     -0.126695   0.022390  -5.659 1.53e-08 ***
## age         -0.081083   0.013255  -6.117 9.53e-10 ***
## cond         0.142288   0.005768  24.670  < 2e-16 ***
## edu          0.021805   0.002394   9.107  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 14493  on 2499  degrees of freedom
## Residual deviance: 12888  on 2490  degrees of freedom
## AIC: 20997
## 
## Number of Fisher Scoring iterations: 5
kable(exp(coef(visitPois)))
x
(Intercept) 9.7401154
healthStat1 0.7379750
healthStat2 0.5608378
sex 0.9638419
race 0.9358249
adldiff 1.1386968
privins 0.8810026
age 0.9221173
cond 1.1529092
edu 1.0220442

Interpretation: All of the variables appear to be associated with the outcome. For each additional chronic condition an individual has, the expected log count of visits increases by 0.14 (visits increase by 1.15).

3. Model Assumptions

Evaluate the fit of this model. Are the assumptions for Poisson model met?

# evaluate model assumptions
Deviance <- deviance(visitPois)
X2Pois <- sum(resid(visitPois, type = "pearson")^2)
df <- visitPois$df.residual

kable(vif(visitPois))
GVIF Df GVIF^(1/(2*Df))
healthStat 1.357668 2 1.079440
sex 1.015633 1 1.007786
race 1.127484 1 1.061830
adldiff 1.311183 1 1.145069
privins 1.214384 1 1.101991
age 1.106567 1 1.051935
cond 1.239751 1 1.113441
edu 1.178882 1 1.085763
influenceIndexPlot(visitPois, c("Cook", "student", "hat"))

Deviance / df
## [1] 5.175817
X2Pois / df
## [1] 7.060422
pchisq(Deviance, df = visitPois$df.residual, lower.tail = FALSE)
## [1] 0
pchisq(X2Pois, df = visitPois$df.residual, lower.tail = FALSE)
## [1] 0

Interpretation: The assumptions for Poisson regression are not met. The largest violation that stands out is the discrepency between the mean and the variance, which are vastly different. This is also shown in our calculated 𝜙 values. In addition, the diagnostic plots show a number of highly influential, potentially problematic points.

4. Consequences of unmet assumptions

What are some of the consequences of overdispersion in a Poisson model? That is, if there is overdispersion in the model that isn’t accounted for, what can go wrong? What are some ways we can account for overdispersion to make our conclusions more valid?

Response: Overdispersion in our model can lead to standard errors different from the true values, reducing our ability for inference based on the model. This can be remedied by either adjusting the standard errors or by using a different model

quasiPois <- glm(visit ~ healthStat + sex + race + adldiff + privins + age + cond + edu, data = data, family = quasipoisson(link = "log"))
summ(quasiPois)
## Note: Pseudo-R2 for quasibinomial/quasipoisson families is calculated by
## refitting the fitted and null models as binomial/poisson.
Observations 2500
Dependent variable visit
Type Generalized linear model
Family quasipoisson
Link log
χ²(9) 1605.67
Pseudo-R² (Cragg-Uhler) 0.47
Pseudo-R² (McFadden) 0.07
AIC NA
BIC NA
Est. S.E. t val. p
(Intercept) 2.28 0.28 8.11 0.00
healthStat1 -0.30 0.06 -5.14 0.00
healthStat2 -0.58 0.11 -5.03 0.00
sex -0.04 0.04 -0.86 0.39
race -0.07 0.08 -0.85 0.40
adldiff 0.13 0.05 2.38 0.02
privins -0.13 0.06 -2.13 0.03
age -0.08 0.04 -2.30 0.02
cond 0.14 0.02 9.28 0.00
edu 0.02 0.01 3.43 0.00
Standard errors: MLE
sink("quasiPois.txt")
summary(quasiPois)
## 
## Call:
## glm(formula = visit ~ healthStat + sex + race + adldiff + privins + 
##     age + cond + edu, family = quasipoisson(link = "log"), data = data)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.276253   0.280721   8.109 7.95e-16 ***
## healthStat1 -0.303845   0.059143  -5.137 3.00e-07 ***
## healthStat2 -0.578324   0.114933  -5.032 5.20e-07 ***
## sex         -0.036828   0.043041  -0.856 0.392274    
## race        -0.066327   0.078024  -0.850 0.395362    
## adldiff      0.129884   0.054471   2.384 0.017179 *  
## privins     -0.126695   0.059493  -2.130 0.033303 *  
## age         -0.081083   0.035221  -2.302 0.021411 *  
## cond         0.142288   0.015325   9.284  < 2e-16 ***
## edu          0.021805   0.006362   3.427 0.000619 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.060462)
## 
##     Null deviance: 14493  on 2499  degrees of freedom
## Residual deviance: 12888  on 2490  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
sink()

As is seen, using a quasipoisson model leads to larger standard errors than the unadjusted model.

5. Negative Binomial Model

Fit a negative binomial regression model using the same variables as the Poisson model above. Show the model output. Are there any differences between the trends found in this model and the trends found in the Poisson model in part 2? Why do you think that is?

# fit negative binomial model
# from the MASS package: glm.nb(y ~ x1 + x2 + ...., data = data_name)
visitNB <- MASS::glm.nb(visit ~ healthStat + sex + race + adldiff + privins + age + cond + edu, data = data)

# show model output
summ(visitNB)
## Error in glm.control(...) : 
##   unused argument (family = list("Negative Binomial(1.5975)", "log", function (mu) 
## log(mu), function (eta) 
## pmax(exp(eta), .Machine$double.eps), function (mu) 
## mu + mu^2/.Theta, function (y, mu, wt) 
## 2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev) 
## {
##     term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
##     2 * sum(term * wt)
## }, function (eta) 
## pmax(exp(eta), .Machine$double.eps), expression({
##     if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
##     n <- rep(1, nobs)
##     mustart <- y + (y == 0)/6
## }), function (mu) 
## all(mu > 0), function (eta) 
## TRUE, function (object, nsim) 
## {
##     ftd <- fitted(object)
##     rnegbin(nsim * length(ftd), ftd, .Theta)
## }))
## Warning: Something went wrong when calculating the pseudo R-squared. Returning NA
## instead.
Observations 2500
Dependent variable visit
Type Generalized linear model
Family Negative Binomial(1.5975)
Link log
χ²(NA) NA
Pseudo-R² (Cragg-Uhler) NA
Pseudo-R² (McFadden) NA
AIC 14388.16
BIC 14452.23
Est. S.E. z val. p
(Intercept) 2.15 0.24 8.93 0.00
healthStat1 -0.35 0.06 -6.12 0.00
healthStat2 -0.61 0.09 -6.68 0.00
sex -0.04 0.04 -1.02 0.31
race -0.08 0.06 -1.23 0.22
adldiff 0.13 0.05 2.70 0.01
privins -0.14 0.05 -2.87 0.00
age -0.06 0.03 -2.14 0.03
cond 0.16 0.01 11.24 0.00
edu 0.02 0.01 4.33 0.00
Standard errors: MLE
# saving to text file for reference
sink("visitNB.txt")
summary(visitNB)
## 
## Call:
## MASS::glm.nb(formula = visit ~ healthStat + sex + race + adldiff + 
##     privins + age + cond + edu, data = data, init.theta = 1.597506752, 
##     link = log)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.145003   0.240269   8.928  < 2e-16 ***
## healthStat1 -0.345198   0.056382  -6.122 9.21e-10 ***
## healthStat2 -0.611589   0.091562  -6.680 2.40e-11 ***
## sex         -0.037381   0.036748  -1.017  0.30905    
## race        -0.079038   0.064090  -1.233  0.21749    
## adldiff      0.132677   0.049097   2.702  0.00689 ** 
## privins     -0.143212   0.049923  -2.869  0.00412 ** 
## age         -0.064386   0.030022  -2.145  0.03198 *  
## cond         0.158599   0.014112  11.239  < 2e-16 ***
## edu          0.023324   0.005386   4.330 1.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.5975) family taken to be 1)
## 
##     Null deviance: 3029.3  on 2499  degrees of freedom
## Residual deviance: 2704.4  on 2490  degrees of freedom
## AIC: 14388
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.5975 
##           Std. Err.:  0.0565 
## 
##  2 x log-likelihood:  -14366.1610
sink()

Response: The standard errors of the negative-binomial model are larger than those of the poisson model. The point estimates are roughly the same.

6. LRT

Run a likelihood ratio test to compare the Poisson and negative binomial models. What does this test tell us?

# likelihood ratio test
odTest(visitNB)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
## 
## Critical value of test statistic at the alpha= 0.05 level: 2.7055 
## Chi-Square Test Statistic =  6610.3991 p-value = < 2.2e-16

Response: With a test statistic of 𝛘²=6610.399 and p=2.2e-16, we can conclude that the poisson model is not appropriate to use and that 𝜃 ≠ 0.

7. Model fit metrics

Compare the Poisson and negative binomial models using other metrics of model fit (e.g. log-likelihood, deviance, AIC, BIC, pseudo R^2). You don’t have to do all of these, but check a few. Pick the model you think is best.

AIC1 <- AIC(visitPois)
BIC1 <- BIC(visitPois)

AIC2 <- AIC(visitNB)
BIC2 <- BIC(visitNB)
Deviance2 <- deviance(visitNB)

diagPois <- tibble(AIC1, BIC1, Deviance)
kable(diagPois, caption = "Diagnostic Values of Poisson Model")
Diagnostic Values of Poisson Model
AIC1 BIC1 Deviance
20996.56 21054.8 12887.78
diagNB <- tibble(AIC2, BIC2, Deviance2)
kable(diagNB, caption = "Diagnostic Values of Negative-Binomial Model")
Diagnostic Values of Negative-Binomial Model
AIC2 BIC2 Deviance2
14388.16 14452.23 2704.434
ggResid(visitPois, title = "Residuals vs Fitted Values (Poisson)") %>%
  ggplotly()
ggResid(visitNB, title = "Residuals vs Fitted Values (Negative-Binomial)") %>%
  ggplotly()

8. Difference in log expected count

Using the model you picked, how much higher would the log of the expected number of visits be for someone with 1 chronic health condition compared to someone with none?

# you can do this by hand using the coefficients printed out by summary
# you can also grab the coefficients from the model using model_name$coefficients and manipulating this vector the get the values you want

\[\log(\text{Expected Count}) = \beta_0 + \beta_1 \times \text{cond}\] \[\log(\text{Expected Count}) = 2.1502 + 0.1586 \times \text{cond}\] \[\log(\text{Expected Count}_{\text{cond}=1}) - \log(\text{Expected Count}_{\text{cond}=0}) = \beta_1 \times (1 - 0)\] \[\log(\text{Expected Count}_{\text{cond}=1}) - \log(\text{Expected Count}_{\text{cond}=0}) = 0.1586 \times 1\]

As the coefficient for health conditions in the NB model is 0.159, someone with one chronic health condition would have a log expected number of visits 0.159 higher than someone with none, holding all else equal.

9. Expected count

Exponentiate the number you got in part 8 and interpret this value in terms of the expected number of of visits comparing the person with 1 chronic condition to the person with no chronic conditions.

exp(0.158599)
## [1] 1.171868

Interpretation: Someone with one health condition would have 1.17 more expected visits compared to someone with none, holding all else equal.

10. Modling effect modification

Using the model you picked above, add an interaction term between private insurance and chronic conditions. Using the model outcome, calculate the same comparisons as in parts 8 and 9 for people with insurance and for people without insurance. Interpret the coefficients for chronic conditions, private insurance, and their interaction in this model.

# model with interaction
NBInter <- glm.nb(visit ~ healthStat + sex + race + adldiff + privins * cond + age + edu, data = data)
# show output
summ(NBInter)
## Error in glm.control(...) : 
##   unused argument (family = list("Negative Binomial(1.5984)", "log", function (mu) 
## log(mu), function (eta) 
## pmax(exp(eta), .Machine$double.eps), function (mu) 
## mu + mu^2/.Theta, function (y, mu, wt) 
## 2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev) 
## {
##     term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
##     2 * sum(term * wt)
## }, function (eta) 
## pmax(exp(eta), .Machine$double.eps), expression({
##     if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
##     n <- rep(1, nobs)
##     mustart <- y + (y == 0)/6
## }), function (mu) 
## all(mu > 0), function (eta) 
## TRUE, function (object, nsim) 
## {
##     ftd <- fitted(object)
##     rnegbin(nsim * length(ftd), ftd, .Theta)
## }))
## Warning: Something went wrong when calculating the pseudo R-squared. Returning NA
## instead.
Observations 2500
Dependent variable visit
Type Generalized linear model
Family Negative Binomial(1.5984)
Link log
χ²(NA) NA
Pseudo-R² (Cragg-Uhler) NA
Pseudo-R² (McFadden) NA
AIC 14389.05
BIC 14458.94
Est. S.E. z val. p
(Intercept) 2.15 0.24 8.95 0.00
healthStat1 -0.35 0.06 -6.14 0.00
healthStat2 -0.61 0.09 -6.68 0.00
sex -0.04 0.04 -1.00 0.32
race -0.08 0.06 -1.25 0.21
adldiff 0.13 0.05 2.68 0.01
privins -0.20 0.08 -2.70 0.01
cond 0.15 0.02 9.65 0.00
age -0.06 0.03 -2.10 0.04
edu 0.02 0.01 4.30 0.00
privins:cond 0.03 0.03 1.08 0.28
Standard errors: MLE
sink("NBInter.txt")
summary(NBInter)
## 
## Call:
## glm.nb(formula = visit ~ healthStat + sex + race + adldiff + 
##     privins * cond + age + edu, data = data, init.theta = 1.598398003, 
##     link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.150202   0.240244   8.950  < 2e-16 ***
## healthStat1  -0.345855   0.056373  -6.135 8.51e-10 ***
## healthStat2  -0.611947   0.091561  -6.684 2.33e-11 ***
## sex          -0.036873   0.036747  -1.003  0.31565    
## race         -0.080268   0.064103  -1.252  0.21051    
## adldiff       0.131583   0.049142   2.678  0.00741 ** 
## privins      -0.204568   0.075794  -2.699  0.00695 ** 
## cond          0.150834   0.015628   9.652  < 2e-16 ***
## age          -0.063086   0.030031  -2.101  0.03567 *  
## edu           0.023190   0.005387   4.305 1.67e-05 ***
## privins:cond  0.033858   0.031480   1.076  0.28213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.5984) family taken to be 1)
## 
##     Null deviance: 3030.5  on 2499  degrees of freedom
## Residual deviance: 2704.4  on 2489  degrees of freedom
## AIC: 14389
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.5984 
##           Std. Err.:  0.0566 
## 
##  2 x log-likelihood:  -14365.0480
sink()

exp(coef(NBInter))
##  (Intercept)  healthStat1  healthStat2          sex         race      adldiff 
##    8.5865950    0.7076152    0.5422941    0.9637983    0.9228688    1.1406321 
##      privins         cond          age          edu privins:cond 
##    0.8149992    1.1628034    0.9388627    1.0234608    1.0344373

Chronic Conditions: For individuals without private insurance, each additional chronic condition is associated with a ~16.3% increase in the expected count of physician visits (0.159 log visits), holding all other variables constant.
Private Insurance: -0.205 For individuals without chronic conditions, having private insurance is associated with a ~18.5% decrease in expected visits (-0.205 log visits), holding all else constant.
Interaction Term: 0.034 For those with private insurance, every additional chronic condition increases the number of expected office visits by ~20.1% (0.151 + 0.034 log visits), compared to the 16.3% increase for those without, holding all else constant.
The p-value for the interaction term (0.282) suggests that this effect is not significant.

11. Testing for Effect modification

Test if the association between chronic conditions and number of visits is modified by private insurance status (using the mode fit in part 10). Interpret the results of this test.

# likelihood ratio test
anova(visitNB, NBInter)
## Likelihood ratio tests of Negative Binomial Models
## 
## Response: visit
##                                                            Model    theta
## 1 healthStat + sex + race + adldiff + privins + age + cond + edu 1.597507
## 2 healthStat + sex + race + adldiff + privins * cond + age + edu 1.598398
##   Resid. df    2 x log-lik.   Test    df LR stat.   Pr(Chi)
## 1      2490       -14366.16                                
## 2      2489       -14365.05 1 vs 2     1  1.11291 0.2914502

Interpretation: With p=0.291, we do not have evidence at the 0.05 significance level to support that private insurance status modifies the relationship between the number of chronic conditions and the expected count of physician visits.

12. Checking model fit

Assess the model fit from the model you picked. Are there any problematic points that you’re worried about?

# check model fit/problematic points
AIC3 <- AIC(NBInter)
BIC3 <- BIC(NBInter)
Deviance3 <- deviance(NBInter)
R2Int <- pR2(NBInter)
## fitting null model for pseudo-r2
diagNBInt <- tibble(AIC3, BIC3, Deviance3, NBInter)
## Error in `tibble()`:
## ! All columns in a tibble must be vectors.
## ✖ Column `NBInter` is a `negbin/glm/lm` object.
kable(diagNBInter, caption = "Diagnostic Values of Interaction Term Negative-Binomial Model")
## Error in eval(expr, envir, enclos): object 'diagNBInter' not found
influenceIndexPlot(NBInter, c("Cook", "student", "hat"))

Interpretation: There are some points that are of concern. In the Cook’s Distance plot, a couple points are drastically higher than the rest (477 and 1284), suggesting these points are highly influential. There are a couple points which have much higher hat-values than the rest, suggesting that they are very high leverage and potentially influential.

13. Unequal follow-up

Suppose that instead of asking people how many times they had visited a physicians office in the past two years, ran a cohort study to track the number of physician office visits per person, but people are followed for unequal periods of time (some people get enrolled later, some people move, etc.). How could we have handled this variability in follow-up time across patients in our regression models?

Response: This variability could be accounted for by modelings instead of counts. It could also be accounted for by using an offset of person-time in the model.

14. Zero-Inflation

Are you concerned at all about zero-inflation in the model you chose? Why or why not?

# checking for zero-inflation
performance::check_zeroinflation(NBInter)
## # Check for zero-inflation
## 
##    Observed zeros: 141
##   Predicted zeros: 211
##             Ratio: 1.50
## Model is overfitting zeros.
performance::check_zeroinflation(visitNB)
## # Check for zero-inflation
## 
##    Observed zeros: 141
##   Predicted zeros: 210
##             Ratio: 1.49
## Model is overfitting zeros.

Response: Based on the ratio of predicted to observed of 1.50 in the interaction model and 1.49 in the simple model, there is concern for zero-inflation.

15. Summary

Summarize your findings.

Summary: In this lab, we intended to see if there is a correlation between the expected number of physician office visits and multiple predictors. A poisson model was originally fit, with all covariates being found to be statistically significantly associated with the outcome. This model did not meet the assumptions for a poisson model, however, so a negative binomial model was used instead. Sex and race were not statistically significantly associated in this. A likelyhood ratio test further suggested that a poisson model was not appropriate to use. Smaller AIC and BIC values in the negative binomial model further suggest being more appropriate. Influential and high-leverage points were tested for and discovered, potentially reducing how accurate our model is. An interaction term of private insurance and number of chronic conditions was added, but this was found to not be statistically significant. The interaction and non-interaction models both showed zero-inflation.

sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26090)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rlang_1.1.3      broom_1.0.5      ggh4x_0.2.8      jtools_2.2.2    
##  [5] plotly_4.10.4    ggthemr_1.1.0    kableExtra_1.4.0 ggpubr_0.6.0    
##  [9] car_3.1-2        carData_3.0-5    lmtest_0.9-40    zoo_1.8-12      
## [13] pscl_1.5.9       MASS_7.3-60.0.1  here_1.0.1       lubridate_1.9.3 
## [17] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.2     
## [21] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_3.4.4   
## [25] tidyverse_2.0.0  knitr_1.45      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0    viridisLite_0.4.2   fastmap_1.1.1      
##  [4] lazyeval_0.2.2      digest_0.6.34       timechange_0.3.0   
##  [7] lifecycle_1.0.4     ellipsis_0.3.2      survival_3.5-7     
## [10] magrittr_2.0.3      compiler_4.3.2      sass_0.4.8         
## [13] tools_4.3.2         utf8_1.2.4          yaml_2.3.8         
## [16] data.table_1.14.10  ggsignif_0.6.4      labeling_0.4.3     
## [19] fitdistrplus_1.1-11 htmlwidgets_1.6.4   bit_4.0.5          
## [22] xml2_1.3.6          abind_1.4-5         withr_3.0.0        
## [25] grid_4.3.2          fansi_1.0.6         colorspace_2.1-0   
## [28] scales_1.3.0        insight_0.19.7      cli_3.6.2          
## [31] rmarkdown_2.25      crayon_1.5.2        generics_0.1.3     
## [34] performance_0.10.8  rstudioapi_0.15.0   httr_1.4.7         
## [37] tzdb_0.4.0          cachem_1.0.8        pander_0.6.5       
## [40] splines_4.3.2       parallel_4.3.2      rmdformats_1.0.4   
## [43] vctrs_0.6.5         Matrix_1.6-5        jsonlite_1.8.8     
## [46] bookdown_0.37       hms_1.1.3           bit64_4.0.5        
## [49] rstatix_0.7.2       systemfonts_1.0.5   crosstalk_1.2.1    
## [52] jquerylib_0.1.4     glue_1.7.0          stringi_1.8.3      
## [55] gtable_0.3.4        munsell_0.5.0       pillar_1.9.0       
## [58] htmltools_0.5.7     R6_2.5.1            rprojroot_2.0.4    
## [61] vroom_1.6.5         evaluate_0.23       lattice_0.22-5     
## [64] highr_0.10          backports_1.4.1     bslib_0.6.1        
## [67] Rcpp_1.0.12         svglite_2.1.3       xfun_0.41          
## [70] pkgconfig_2.0.3